Solve Adok's quiz and more with dynamic programming

Bonz

Adok's quizzes have become a constant appointment in the last few issues of Hugi. Their difficulty has ranged from easy ones, to those that require some (basic) knowledge of trigonometry or physics, to those that can be solved with some intuition, to those that need a computer and some good thoughts on algorithms to be solved. There was actually just one of the last kind, that is, the hard quiz in Hugi 25 -- and it was this one that prompted me to write this article.

The quiz, shortly speaking, goes like this: you have a monotonic non-descending sequence of ten numbers from 0 to 255. How many possibilities do you have to form exactly the number 1815? An example of a valid sequence is 10, 50, 100, 190, 203, 250, 250, 252, 255, 255.

The number of possible sequences is astronomic, a good approximation being 256^10 which is a 27-digit number: if you could analyze a sequence per nanosecond, it would take some billion years to compute all of them! Luckily there is a good way to cut the time so much that it becomes feasible to solve the quiz even for, say, sequences of fifty numbers from 0 to 999, in a few minutes.

The magic that achieves this is called dynamic programming. The idea is to divide the possible subproblems in classes which behave in the same way with respect to the successive steps towards solving the problem. In Adok's problem, for example, it does not matter if the first three numbers in the sequence are 10, 20, 30 or 9, 21, 30: in both cases the next number will have to be greater than 29 and the running total is 60.

The complexity of the resulting algorithm is in some way proportional not to the number of solutions, but to the number of classes in which you divide them, times the steps in the problem. In Adok's problem, without any particular optimization, you have 255*10=2550 possible running totals and 255 possible values for the next number in the sequence, and you have 10 numbers in the sequence, so the complexity will be 255*10*255*10, or (255*10)^2. Exponentiation (non-polynomiality) has been turned into a product (polynomiality, hence tractability).

The drawbacks? Substantially there are two. 1) You need a lot of memory, at least compared to the non-polynomial, brute-force, solution (which only requires ten words of memory). 2) You don't have access to all the solutions: almost always however you can track the steps you made while the algorithm ran and reconstruct at least one.

But let's attack a few simpler problems before solving Adok's.

Two examples of dynamic programming

Possibly the simplest example is the same as Adok's quiz, but without the requirement on the monotonicity of the sequence of random numbers. That is, in how many ways can you obtain 1815 by summing ten numbers between 0 and 255? I will phrase my description of the algorithm as if each number was taken by rolling a 256-sided dice.

Here you have 2550 classes corresponding to the running total in the algorithm. How can you reach, for example, 400 in three rolls? Well, first of all you can do so only if the running total in the first two rolls is between 400-255=145 and 400 itself. Then, if you had n ways to reach 145 in two rolls, these will be all valid to reach 400 in three rolls -- as long as you do 255 on the third roll; if you had m ways to reach 146 in two rolls, these will be all valid to reach 400 in three rolls -- as long as you do 254 on the third roll; and so on. So we have a simple program:

    #define MIN(a,b) ((a)<(b)?(a):(b)

    double m1[2550], m2[2550], *old, *new, *t;
    int i,j,k;
    
    old = m1; new = m2;

    /* Initialize the loop! We do have a way to reach 0 with zero rolls. */
    old[0] = 1.0;
    for (i = 1; i < 2550; i++)
      old[i] = 0.0;

    for (j = 1; j <= 10; j++, t = new, new = old, old = t)
      for (i = 0; i <= 2550; i++)
        for (new[i] = 0.0, k = 0; k <= MIN(255, i); k++)
          new[i] += old[i-k];

    printf ("%g possible solutions\n", new[1815]);

The outer loops runs ten times and does exactly what was outlined in the above description: each time the solution is recomputed, taking into account an additional dice roll.

Note that you can make some optimizations on the upper limits of the cycles. The most obvious is to cycle i up to 255*j, because there is no way to obtain 500 with a single dice, for example. The second is to avoid computing totals over 1815 since in no way can we decrease our running total. This kind of optimization will pay a lot when solving Adok's quiz, cutting run time by a factor of twelve.

Also note that I am using two arrays. This is already an optimization because in fact you're using a 2550x10 matrix but, since you only need the current line and the previous line, you can store them in two arrays and use them in a circular manner. Always remember that you are working on a subset of a matrix, because using a single array will most likely cause bugs.

This second problem looks harder, but in fact it is not very dissimilar because it is also a counting problem. The biggest difference is that you have an input which influences the decision on which elements of the previous rows have to be summed. Here is the problem: you have a text and you have to find the longest palindrome subsequence (including spaces). In case you did not know, a palindrome is a sentence that is read the same from left to right or from right to left; unluckily I don't have any English examples, so I'll give the rather fun Italian palindrome I topi non avevano nipoti (the mice had no nephews).

What should give us the hint to base our solution on dynamic programming is the ease to compute a solution from that of a smaller subproblem. Take the word CASSATA (which is the name of an Italian cake in case you're interested). Considering it a letter at a time we would say that the longest palindrome is one letter long (and it does not take much imagination to figure it out!). Considering it two letters at a time, we would find a two-letter palindrome (the two S's).

Now the fun begins: we can also find a three-letter palindrome subword, ATA, by looking at each one-letter palindrome and checking that it has two equal letter on its sides. But what is the longest palindrome subword in SATA, which has different first and last letters? Well, there are only two possibilities to reduce it to a problem of smaller size: either it is the longest palindrome in SAT, or it is the longest palindrome in ATA.

The idea then is to create a matrix of the possible subwords, indexed by the ordinal number of the first character in the subword and by its length. Actually you see that we need only the latest two lengths (hence three arrays as long as the string), but for now let's keep things simple: a[i][j] will be the length of the longest palindrome subword of the j characters starting from the i-th in the string s.

The subdivision in classes may not be clear at first sight, but in fact in this case there is exactly a class for each possible solution: what we save is that we don't have to check the palindrome-ness of the whole possible solution. The complexity goes down from cubic (n^2 possible solutions and n iteration to check if they are indeed palindromes) to quadratic.

The shortness of the program's body makes it look quite surreal and mysterious...

    n = strlen(s);
    for (i = 0; i < n; i++)
      sol[i][0] = 0, sol[i][1] = 1;

    for (j = 2; j < n; j++)
      for (i = 0; i < n - j; i++)
        sol[i][j] = s[i] == s[i+j-1]
          ? 2+s[i+1][j-2]
          : MAX(s[i][j-1], s[i+1][j-1]);

To cut on memory consumption, just add a modulo 3 operation to the lengths.

Solving Adok's quiz

Even with only the above examples of dynamic programming algorithms, and with the short outline of the solution that I gave in the introduction, you should readily be able to understand this program. The similarity with the first program above is huge.

    #include 
    #include 
    #include 
    
    /* These had to be #define's to declare the arrays below
       with sizes based on their values.  */
    #define obj 1815
    #define max_dice 255
    #define max_roll 10
    
    /* This is GNU C++'s special syntax for MIN and MAX.  Maybe
       it is a bit faster than the standard way to write them.  */
    #define MIN(a,b) ((a) ? (b))

    typedef double matrix[max_dice+1][obj+1];
    matrix odd, even;

    int
    main()
    {
      int i,j,k,l,min_random,min,max,min_iter,max_iter;
      matrix *n1, *n0;
      double cnt;

      even[0][0] = 1;
      for (i=1; i<=max_roll; i++)
        {
          if (i & 1)
            n1 = &odd, n0 = &even;
          else
            n1 = &even, n0 = &odd;

          memset (n1, '\0', sizeof (matrix));

          /* On the first iteration, we can limit ourselves to k = 0, 
             that is, all numbers can be generated.  This is because
             the mad coder's RNG has not generated any number (in
             practice, all n[k][] are still zero for k > 0). */
          for (k = (i == 1 ? 0 : 255); k >= 0; k--)
            for (l=max_dice; l>=k; l--)
              for (j = obj; j>=l; j--)
                (*n1) [l][j] += (*n0) [k][j-l];
        }

      for (cnt = 0, k=0; k<=max_dice; k++)
        cnt += (*n1) [k][obj];

      printf ("total             %16.0f\n", cnt);
      exit(0);
    }

In short, k is the highest number generated so far and l is the next generated number.

This program takes about half a minute on my 266 MHz Pentium II, but we can, no doubt, make it a lot faster (and this is not only because I anticipated that above). I'll outline the optimizations, and give you implementation hints, but will not present the code until the end, both to make the article less full with code and to make it possible for you to experiment.

First of all, we are summing a lot of zero elements. The easiest way to avoid doing so, is to track which is the highest set element for each possible value of the last roll. For example, after eight rolls and with the next value having to be greater than or equal to 100, it is not possible to have reached a running total above 800, is it?

To do so, I introduced a new array max_set[] which is kept updated after each iteration of the j loop. If max_set[k] is n, on the next iteration max_set[l] will not be possibly greater than n+l. This alone sped up the program to 7.5 seconds.

Now, consider the same example of reaching 800 after eight rolls. How can you possibly dream to do 1015 more points in two rolls?!? To obtain the lowest possible totals that still allow us to reach 1815, consider that a way to obtain them is to do a series of zero rolls, a low number that allows to reach exactly 255, and then a series of 255 rolls. In particular 0, 0, 10 followed by seven 255's.

Of course, this is not the correct way to prune the set. We only obtained a lower bound to the running total, not to the last generated number k. It does not take too much IQ (quote from Aretha Franklin...) to see that the smallest k that can generate a particular running total is obtained by rolling identical numbers. You can precompute these values by hand or have the program do the job for you. Time went down to 3 seconds.

For just a little more performance squeezing, consider the opposite case. I have rolled nine dices and got to 1809. Yes, I did not bust the limit of 1815, but to do so I had to do at least 1809/9=201 on my last roll. No way I am going to score less than 2010.

In general, if I am going to roll an l and still have 10-i rolls left, I will do at least max=l * (10-i) points -- and I don't want to compute values above 1815-max, because they give me no chance to reach 1815.

Coded? If not (for laziness undoubtedly, since the job is actually not so hard), here is the program. To do the timings, fiddle with the definition of ALGORITHM (note that I actually coded the algorithm backwards, doing the optimization by hand and then, seeing that the timing wasn't that bad, undoing them to double check the result; this is reflected in the program comments).

    #include 
    #include 
    #include 
    
    /* These had to be #define's to declare the arrays below
       with sizes based on their values.  */
    #define obj 1815
    #define max_dice 255
    #define max_roll 10
    
    /* This is GNU C++'s special syntax for MIN and MAX.  Maybe
       it is a bit faster than the standard way to write them.  */
    #define MIN(a,b) ((a) ? (b))
    
    /* This is the fastest algorithm.  */
    #ifndef ALGORITHM
    #define ALGORITHM 1
    #endif
    
    typedef double matrix[max_dice+1][obj+1];
    matrix odd, even;
    int max_set[max_dice+1];
    
    int
    main()
    {
      int i,j,k,l,min_random,min,max,min_iter,max_iter;
      matrix *n1, *n0;
      double cnt;
    
      const int max_zero = (max_dice * max_roll - obj) / max_dice;
      const int first_min = obj % max_dice;
    
      even[0][0] = 1;
      min = min_random = 0;
      for (i=1; i<=max_roll; i++)
        {
          if (i & 1)
            n1 = &odd, n0 = &even;
          else
            n1 = &even, n0 = &odd;
    
          memset (n1, '\0', sizeof (matrix));
    
          /* The lowest totals that allow us to reach obj are achieved by
             rolling a series of zero, then obj % max_dice, and then
             a series of 255.  This is the lowest because we use
             the minimum possible value until we have to use 255's.
             min contains the minimum running total at the end of
             this roll. */
          if (i == max_zero + 1)
            min = first_min;
          else if (i > max_zero)
            min += max_dice;
    
          printf ("roll #%d, min %d min_random %d\n", i, min, min_random);
    
          /* On the first iteration, we can limit ourselves to k = 0, 
             that is, all numbers can be generated.  This is because
             the mad coder's RNG has not generated any number (in
             practice, all n[k][] are still zero for k > 0). */
          for (k = (i == 1 ? 0 : max_dice); k >= min_random; k--)
            {
              for (l=max_dice; l>=k; l--)
                {
    #if ALGORITHM == 1
                  /* The idea behind this algorithm is that if we still
                     have to roll max_roll-i times, we will do at least
                     l * (max_roll - i) points.  So we must not be over
                     max, otherwise we will surely get well over obj.  */
                  max = obj - l * (max_roll - i);
                  max_iter = MIN(max, max_set[k] + l);
                  min_iter = MAX(min, l);
    #endif
    
    
    #if ALGORITHM == 2
                  /* This algorithm is a variation of the above.  It does
                     not compute exactly the same pruned sets, but this
                     has bigger sets rarely enough that the timing is
                     about the same for these two algorithms.  The set is
                     different if max_set[k] <= max <= max_set[k] + l.  */
                  max = obj - l * (max_roll - i + 1);
                  max_iter = MIN(max, max_set[k]) + l;
                  min_iter = MAX(min, l);
    #endif
    
    
    #if ALGORITHM == 3
                  /* This is a simple pruning based on recording which
                     elements of n[][k] were already set, and on removing
                     values that are to low to possibly reach obj.  */
                  max_iter = MIN(obj, max_set[k] + l);
                  min_iter = MAX(min, l);
    #endif
    
    
    #if ALGORITHM == 4
                  /* Just to double check, I omitted the pruning of the
                     bottom elements here.  It gave the same result as
                     the previous algorithms.  */
                  max_iter = MIN(obj, max_set[k] + l);
                  min_iter = l;
    #endif
    
    
    #if ALGORITHM == 5
                  /* And for a definitive check, I omitted the pruning of
                     unset (still zero) elements here.  Again, the result
                     did not change.  */
                  max_iter = obj;
                  min_iter = l;
    #endif
    
                  for (j = max_iter; j>=min_iter; j--)
                    (*n1) [l][j] += (*n0) [k][j-l];
    
                  if (max_iter > max_set[l]) max_set[l] = max_iter;
                }
             }
    
          /* If, so far, our running total had to be at least min,
             it might have been reached by doing all equal rolls.
             This would have allowed us to achieve it with the lowest
             minimum generated value, min_random, for the next
             iteration.  */
          min_random = (min + i - 1) / i;
        }

      for (cnt = 0, k=0; k<=max_dice; k++)
        cnt += (*n1) [k][obj];

      printf ("total             %16.0f\n", cnt);
      exit(0);
    }

Caching

A sidenote which I am including because just after I solved this quiz I had a question on caching asked in a university exam. The question was to compute the number of cache misses when accessing an array by row or by column -- I already knew that you should access arrays by row, but the exact answer was quite dramatic: if the array does not fit the cache, and say that four items of the array fit in a cache line, accessing by column gives you four times more cache misses.

Now, it turns out that I had by mistake turned my arrays around when I had coded the solution to Adok's quiz. Before preparing this article, I decided to polish the code and noticed this fact. Well, the time to compute the solution on a 266 MHz PII fell from three to one seconds in the optimized version, and from well over two minutes to 25 seconds in the non-optimized one. This is just in case you thought that taking caching effects into account was only needed to squeeze out the last bit of performance.

(The palindrome loop above also has bad cache locality, because it accesses arrays by columns, but the dataset should fit even in the L1 cache).

Dynamic programming and probabilistic problems

Another interesting class of games which are well tackled through dynamic programming is those involving probability. Take this: a strange guy wants to have a friend for each possible birth day. How many friends shall he meet, on average, before he achieve his objective? It might look not so hard: the first friend will surely have a unique birthday, the second will with a probability of 364/365 (and hence he will on average need to meet 365/364 friends before knowing people with two distinct birth dates), the third birth date will require knowing 365/363 people and so on. The total will be 365*(1+1/2+1/3+...+1/365). Right?

Wrong.

Why?

Because there are leap years.

Right, leap years. Knowing the guy that's born on February 29th will be hard, at least in principle (if you happen to have one in your demo group, well, that's not the average!). But how much harder? Don't try to compute a closed formula that gives you the answer. There exists one, but it takes a hell of summations and related simplifications to extract it. Dynamic programming is a better solution.

Let's tackle the easy problem from a more generic point of view, first. Ever heard about Markov chains? Basically, they are a kind of finite state automaton in which passing from a state to another happens based on a set of probabilities. They can be put in a matrix or a tensor (a matrix of vectors or matrices), and that's how they are usually taught, but don't care about it -- just think of it as a probabilistic transition between states).

So, what are the probabilities in the easy problem? When you have met people with n different birth dates, and hence you are in state n, you can go to state n+1 with probability (365-n)/365, else you stay there. On average, going from state 0 to state n will require this many friends:

    T(n+1) = (365-n)/365 * (1 + T(n)) + n/365 * (1 + T(n+1))

that is, a single friend plus all those you had so far if your lucky, and a wasted friend plus all those you needed in the average case if you are not. Simplifying (solving for T(n+1)) we have:

    T(n+1) = T(n) + 365 / (365 - n)

And iterating this we can compute the solution to the simple game.

In the leap year case, our state space has two dimensions. We say we are in state (m,0) if we met nobody that's birth on February 29th and people born on m different more standard dates; and we are in state (m,1) if we did our lucky meeting. The probability table is:

    (1460-4*m)/1461        for (m,n)->(m+1,n)
    (4*m+n)/1461           for (m,n)->(m,n)
    (1-n)/1461             for (m,n)->(m,1-n)

We can separate n=0 and n=1

    (1460-4*m)/1461        for (m,0)->(m+1,0)
    (4*m)/1461             for (m,0)->(m,0)
    1/1461                 for (m,0)->(m,1)

    (1460-4*m)/1461        for (m,1)->(m+1,1)
    (4*m+1)/1461           for (m,1)->(m,1)

So (I wrote the equation a bit differently from that above, but the idea is the same):

    T(m,0) = (1460-4*m)/1461 * (T(m+1,0) - 1)
           +      (4*m)/1461 * (T(m,0) - 1)
           +          1/1461 * (T(m,1) - 1)

    T(m,1) = (1460-4*m)/1461 * (T(m+1,1) - 1)
           +    (4*m+1)/1461 * (T(m,1) - 1)

This system of two equations can be solved quite easily as well:

    T(m,1) = T(m+1,1) - 1461 / (1460-4*m)
    T(m,0) = (1460-4*m)/(1461-4*m) * T(m+1,0)
           +          1/(1461-4*m) * T(m,1)
           -       1461/(1461-4*m)

or we can solve for T(m+1,n) (which is more suited to computing T(365,1) iteratively):

    T(m+1,1) = T(m,1) + 1461 / (1460-4*m)

    T(m+1,0) =         -1/(1460-4*m) * T(m,1)
             + (1461-4*m)/(1460-4*m) * T(m,0)
             +       1461/(1460-4*m)

Whew! The formulas were not particularly easy, but at least it was possible to arrive to a solution without big reasonings.

Now try this (I don't remember the exact numeric solution, but to check your program, it should increase as log2 n as you increase n). You have 100 coins in a box. Shake it, extract the heads, close the box, and repeat. After how many tries will you have emptied the box on average? (Hint: there is a probability of n!/k!(n-k)!2^n to do k heads with n coins).

Conclusion

Dynamic programming is not just for quizzes. For example, there is a dynamic programming solution to the knapsack problem, which used to be the base for a public-key cryptosystem different from RSAs and which has been deemed dead because it does not offer enough guarantees of unbreakability. And some compilers use dynamic programming for their instruction selection phase, in particular to convert expression trees into assembly-language instructions that assume that infinite registers available; the algorithm is remarkably simple and works well not only for RISC instruction sets, but also for CISC instruction sets that, like the 386, are decently orthogonal and exhibit few instruction with side effects (like pre- or post-increments).

Young coders might ask their school to sponsor partecipation to the International Computer-Science Olympics. A lot of the tasks require good knowledge of dynamic programming.

Finally, Linux-based coders should (in case they aren't doing that yet) take a look at cachegrind, a great program that instruments your code on-the-fly (without even needing recompilation if you have an executable with only debugging information) to produce line-by-line statistics about its L1 and L2 cache behavior.

--
|_  _  _ __
|_)(_)| ),'
        `-._